bridge network
MCDDPM: Multichannel Conditional Denoising Diffusion Model for Unsupervised Anomaly Detection in Brain MRI
Trivedi, Vivek Kumar, Sharma, Bheeshm, Balamurugan, P.
Detecting anomalies in brain MRI scans using supervised deep learning methods presents challenges due to anatomical diversity and labor-intensive requirement of pixel-level annotations. Generative models like Denoising Diffusion Probabilistic Model (DDPM) and their variants like pDDPM, mDDPM, cDDPM have recently emerged to be powerful alternatives to perform unsupervised anomaly detection in brain MRI scans. These methods leverage frame-level labels of healthy brains to generate healthy tissues in brain MRI scans. During inference, when an anomalous (or unhealthy) scan image is presented as an input, these models generate a healthy scan image corresponding to the input anomalous scan, and the difference map between the generated healthy scan image and the original anomalous scan image provide the necessary pixel level identification of abnormal tissues. The generated healthy images from the DDPM, pDDPM and mDDPM models however suffer from fidelity issues and contain artifacts that do not have medical significance. While cDDPM achieves slightly better fidelity and artifact suppression, it requires huge memory footprint and is computationally expensive than the other DDPM based models. In this work, we propose an improved version of DDPM called Multichannel Conditional Denoising Diffusion Probabilistic Model (MCDDPM) for unsupervised anomaly detection in brain MRI scans. Our proposed model achieves high fidelity by making use of additional information from the healthy images during the training process, enriching the representation power of DDPM models, with a computational cost and memory requirements on par with DDPM, pDDPM and mDDPM models. Experimental results on multiple datasets (e.g. BraTS20, BraTS21) demonstrate promising performance of the proposed method. The code is available at https://github.com/vivekkumartri/MCDDPM.
- Europe > Slovenia > Central Slovenia > Municipality of Ljubljana > Ljubljana (0.04)
- Asia > India > Maharashtra > Mumbai (0.04)
- Health & Medicine > Therapeutic Area > Neurology (1.00)
- Health & Medicine > Health Care Technology (1.00)
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)
Factor-Conditioned Speaking-Style Captioning
Ando, Atsushi, Moriya, Takafumi, Horiguchi, Shota, Masumura, Ryo
This paper presents a novel speaking-style captioning method that generates diverse descriptions while accurately predicting speaking-style information. Conventional learning criteria directly use original captions that contain not only speaking-style factor terms but also syntax words, which disturbs learning speaking-style information. To solve this problem, we introduce factor-conditioned captioning (FCC), which first outputs a phrase representing speaking-style factors (e.g., gender, pitch, etc.), and then generates a caption to ensure the model explicitly learns speaking-style factors. We also propose greedy-then-sampling (GtS) decoding, which first predicts speaking-style factors deterministically to guarantee semantic accuracy, and then generates a caption based on factor-conditioned sampling to ensure diversity. Experiments show that FCC outperforms the original caption-based training, and with GtS, it generates more diverse captions while keeping style prediction performance.
On the Limits of Multi-modal Meta-Learning with Auxiliary Task Modulation Using Conditional Batch Normalization
Armengol-Estapé, Jordi, Michalski, Vincent, Kumar, Ramnath, St-Charles, Pierre-Luc, Precup, Doina, Kahou, Samira Ebrahimi
Few-shot learning aims to learn representations that can tackle novel tasks given a small number of examples. Recent studies show that cross-modal learning can improve representations for few-shot classification. More specifically, language is a rich modality that can be used to guide visual learning. In this work, we experiment with a multi-modal architecture for few-shot learning that consists of three components: a classifier, an auxiliary network, and a bridge network. While the classifier performs the main classification task, the auxiliary network learns to predict language representations from the same input, and the bridge network transforms high-level features of the auxiliary network into modulation parameters for layers of the few-shot classifier using conditional batch normalization. The bridge should encourage a form of lightweight semantic alignment between language and vision which could be useful for the classifier. However, after evaluating the proposed approach on two popular few-shot classification benchmarks we find that a) the improvements do not reproduce across benchmarks, and b) when they do, the improvements are due to the additional compute and parameters introduced by the bridge network. We contribute insights and recommendations for future work in multi-modal meta-learning, especially when using language representations.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > Canada > Quebec > Montreal (0.14)
- North America > Canada > Alberta > Census Division No. 6 > Calgary Metropolitan Region > Calgary (0.04)
- Europe > Belgium (0.04)
An Integration of Pre-Trained Speech and Language Models for End-to-End Speech Recognition
Hono, Yukiya, Mitsuda, Koh, Zhao, Tianyu, Mitsui, Kentaro, Wakatsuki, Toshiaki, Sawada, Kei
Advances in machine learning have made it possible to perform various text and speech processing tasks, including automatic speech recognition (ASR), in an end-to-end (E2E) manner. Since typical E2E approaches require large amounts of training data and resources, leveraging pre-trained foundation models instead of training from scratch is gaining attention. Although there have been attempts to use pre-trained speech and language models in ASR, most of them are limited to using either. This paper explores the potential of integrating a pre-trained speech representation model with a large language model (LLM) for E2E ASR. The proposed model enables E2E ASR by generating text tokens in an autoregressive manner via speech representations as speech prompts, taking advantage of the vast knowledge provided by the LLM. Furthermore, the proposed model can incorporate remarkable developments for LLM utilization, such as inference optimization and parameter-efficient domain adaptation. Experimental results show that the proposed model achieves performance comparable to modern E2E ASR models.
Traversing Between Modes in Function Space for Fast Ensembling
Yun, EungGu, Lee, Hyungi, Nam, Giung, Lee, Juho
Deep ensemble is a simple yet powerful way to improve the performance of deep neural networks. Under this motivation, recent works on mode connectivity have shown that parameters of ensembles are connected by low-loss subspaces, and one can efficiently collect ensemble parameters in those subspaces. While this provides a way to efficiently train ensembles, for inference, multiple forward passes should still be executed using all the ensemble parameters, which often becomes a serious bottleneck for real-world deployment. In this work, we propose a novel framework to reduce such costs. Given a low-loss subspace connecting two modes of a neural network, we build an additional neural network that predicts the output of the original neural network evaluated at a certain point in the low-loss subspace. The additional neural network, which we call a "bridge", is a lightweight network that takes minimal features from the original network and predicts outputs for the low-loss subspace without forward passes through the original network. We empirically demonstrate that we can indeed train such bridge networks and significantly reduce inference costs with the help of bridge networks.
- Asia > South Korea > Seoul > Seoul (0.04)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- Asia > South Korea > Daejeon > Daejeon (0.04)
Learning to Summarize and Answer Questions about a Virtual Robot's Past Actions
DeChant, Chad, Akinola, Iretiayo, Bauer, Daniel
When robots perform long action sequences, users will want to easily and reliably find out what they have done. We therefore demonstrate the task of learning to summarize and answer questions about a robot agent's past actions using natural language alone. A single system with a large language model at its core is trained to both summarize and answer questions about action sequences given ego-centric video frames of a virtual robot and a question prompt. To enable training of question answering, we develop a method to automatically generate English-language questions and answers about objects, actions, and the temporal order in which actions occurred during episodes of robot action in the virtual environment. Training one model to both summarize and answer questions enables zero-shot transfer of representations of objects learned through question answering to improved action summarization. % involving objects not seen in training to summarize.
- Research Report (0.64)
- Workflow (0.55)
Bridge Networks
Olin-Ammentorp, Wilkie, Bazhenov, Maxim
Despite rapid progress, current deep learning methods face a number of critical challenges. These include high energy consumption, catastrophic forgetting, dependance on global losses, and an inability to reason symbolically. By combining concepts from information bottleneck theory and vector-symbolic architectures, we propose and implement a novel information processing architecture, the 'Bridge network.' We show this architecture provides unique advantages which can address the problem of global losses and catastrophic forgetting. Furthermore, we argue that it provides a further basis for increasing energy efficiency of execution and the ability to reason symbolically.
- North America > United States > Texas > El Paso County > El Paso (0.05)
- North America > United States > California > San Diego County > San Diego (0.05)
- North America > United States > New York > New York County > New York City (0.05)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Motion-Attentive Transition for Zero-Shot Video Object Segmentation
Zhou, Tianfei, Wang, Shunzhou, Zhou, Yi, Yao, Yazhou, Li, Jianwu, Shao, Ling
In this paper, we present a novel Motion-Attentive Transition Network (MATNet) for zero-shot video object segmentation, which provides a new way of leveraging motion information to reinforce spatio-temporal object representation. An asymmetric attention block, called Motion-Attentive Transition (MAT), is designed within a two-stream encoder, which transforms appearance features into motion-attentive representations at each convolutional stage. In this way, the encoder becomes deeply interleaved, allowing for closely hierarchical interactions between object motion and appearance. This is superior to the typical two-stream architecture, which treats motion and appearance separately in each stream and often suffers from overfitting to appearance information. Additionally, a bridge network is proposed to obtain a compact, discriminative and scale-sensitive representation for multi-level encoder features, which is further fed into a decoder to achieve segmentation results. Extensive experiments on three challenging public benchmarks (i.e. DAVIS-16, FBMS and Youtube-Objects) show that our model achieves compelling performance against the state-of-the-arts.
- Asia > China > Beijing > Beijing (0.04)
- Asia > Middle East > UAE (0.04)
- Asia > China > Jiangsu Province > Nanjing (0.04)
- Africa > Ghana > Gulf of Guinea (0.04)
Brigitte, a Bridge-Based Grid Path-Finder
We present BRIGITTE, a new path-finding algorithm for 8-connected grids. It is based on the notion bridge that we define here, i.e., a high-level description of paths between all pairs of points from two convex regions that allows fast distance query and fast generation of the prefix and suffix of these paths. BRIGITTE uses a pre-processing step to first partition the map into convex regions and then compute a sufficient set of bridges between every pair of regions. Path-finding is then performed by looking up the regions of the source and target cells and then iterating over the bridges of the pair of regions to determine which one yields the shortest path. BRIGITTE competes favourably compared to CH-SG-R and Copp, although this currently comes at a price of an extensive pre-processing.